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Abstract. In this work we study the problem of scheduling tasks with dependencies in multiprocessor architectures 
j.-^ I where processors have different speeds. We present the preemptive algorithm "Save-Energy" that given a schedule 

("■^ , of tasks it post processes it to improve the energy efficiency without any deterioration of the makespan. In terms 

rvj ' of time efficiency, we show that preemptive scheduling in an asymmetric system can achieve the same or better 

optimal makespan than in a symmetric system. Motivited by real multiprocessor systems, we investigate architec- 
tures that exhibit limited asymmetry; there are two essentially different speeds. Interestingly, this special case has 
not been studied in the field of parallel computing and scheduling theory; only the general case was studied where 
processors have K essentially different speeds. We present the non-preemptive algorithm "Remnants" that achieves 
almost optimal makespan. We provide a refined analysis of a recent scheduling method. Based on this analysis, we 
specialize the scheduling policy and provide an algorithm of (3 + o(l)) expected approximation factor. Note that 
r) , this improves the previous best factor (6 for two speeds). We believe that our work will convince researchers to 

/^ ' revisit this well studied scheduling problem for these simple, yet realistic, asymmetric multiprocessor architectures. 



1 Introduction 



^ ■ It is clear that processors technology is undergoing a vigorous shaking-up to allow one processor 

O^ . socket to provide access to multiple logical cores. Current technology already allows multiple pro- 

(^ i cessor cores to be contained inside a single processor module. Such chip multiprocessors seem to 

'nI" I overcome the thermal and power problems that limit the performance that single-processor chips 

'sj" ' can deliver. Recently, researchers have proposed multiprocessor platforms where individual pro- 

S ■ cessors have different computation capabilities (e.g., see [6]). Such architectures are attractive 

O ■ because a few high-performance complex processors can provide good serial performance, and 
many low-performance simple processors can provide high parallel performance. Such asymmet- 

K^ ; ric platforms can also achieve energy-efficiency since the lower the processing speed, the lower 

;h ' the power consumption is [10]. Reducing the energy consumption is an important issue not only 



for battery operated mobile computing devices but also in desktop computers and servers. 

As the number of chip multiprocessors is growing tremendously, the need for algorithmic solu- 
tions that efficiently use such platforms is increasing as well. In these platforms a key assumption 
is that processors may have different speeds and capabilities but that the speeds and capabilities do 
not change. We consider multiprocessor architectures P = {Pk : k = 1, . . . , m}, where c{k) is the 
speed of processor pfc- The total processing capability of the platform is denoted by p = Y.T=i c(^)- 

One of the key challenges of asymmetric computing is the scheduling problem. Given a parallel 
program of n tasks represented as a dependence graph, the scheduling problem deals with mapping 
each task onto the available asymmetric resources in order to minimize the makespan, that is. 
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the maximum completion time of the jobs. In this work we also look into how to reduce energy 
consumption without affecting the makespan of the schedule. Energy efficiency for speed scaling 
of parallel processors, which is not assumed in this work, was considered in [1]. 

Our notion of a parallel program to be executed is a set of n tasks represented by ^ = (V, £), 
a Directed Acyclic Graph (DAG). The set V represents n = |y| simple tasks each of a unit pro- 
cessing time. If task i precedes task j (denoted via i -< j), then j cannot start until i is finished. 
The set E of edges represents precedence constraints among the tasks. We assume that the whole 
DAG is presented as an input to the multiprocessor architecture. Our objective is to give schedules 
that complete the processing of the whole DAG in small time. Using terminology from schedul- 
ing theory, the problem is that of scheduling precedence-constrained tasks on related processors to 
minimize the makespan. In our model the speed asymmetry is the basic characteristic. We assume 
that the overhead of (re)assigning processors to tasks of a parallel job to be executed is negligible. 

The special case in which the DAG is just a collection of chains is of importance because 
general DAGs can be scheduled via a maximal chain decomposition technique of [4]. Let L = 
{Li, L2, ■ ■ ■ , Lr} a program of r Chains of tasks to be processed. We denote the length of chain 
Lj by /j = |Lj|, the count of the jobs in L^; without loss of generality /i > I2 > ... > lr- 
Clearly n = Y.l=ih- In this case the problem is also known as Chains Scheduling. Note that 
the decomposition technique of [4] requires O (n^) time and the maximal chain decomposition 
depends only on the jobs of the given instance and is independent of the machine environment. 

Because the problem is NP-hard [9] even when all processors have the same speed, the schedul- 
ing community has concentrated on developing approximation algorithms for the makespan. Early 
papers introduce O(y^) -approximation algorithms [7, 8], and more recent papers propose 0(log m) 
approximation algorithms [5,4]. Numerous asymmetric processor organizations have been pro- 
posed for power and performance efficiency, and have investigated the behavior of multi-program- 
med single threaded applications on them. [2] investigate the impact of performance asymmetry 
in emerging multiprocessor architectures. They conduct an experimental methodology on a mul- 
tiprocessor systems where individual processors have different performance. They report that the 
running times of commercial application may benefit from such performance asymmetry. 

Previous research assumed the general case where multiprocessor platforms have K distinct 
speeds. Yet recent technological advances (e.g., see [6, 10]) build systems of two essential speeds. 
Unfortunately, in the scheduling literature, the case of just 2 distinct processor speeds has not been 
given much attention. In fact, the best till now results of [4] reduce instances of arbitrary (but 
related) speeds, to at most K = O(logm) distinct speeds. Then the same work gives schedules of 
a makespan at most 0{K) times the optimal makespan, where 0{K) is 6K for general DAGs. We 
consider architectures of chip multiprocessors consisting of m processors, with m^ fast processors 
of speed s > 1 and of m — m^ slow processors of speed 1 where the energy consumption per unit 
time is a convex function of the processor speed. Thus, our model is a special case of the uniformly 
related machines case, with only two distinct speeds. In fact, the notion of distinct speeds used in 
[5] and [4] allows several speeds for our model, but not differing much from each other. So for the 
case of 2 speeds, considered here, this gives a 12-factor approximation for general DAGs. Our goal 
here is to improve on this and under this simple model provide schedules with better makespan. 
We also focus on the special case where the multiprocessor system is composed of a single fast 
processor and multiple slow processors, like the one designed in [6]. Note that [3] has recently 
worked on a different model that assumes asynchronous processors with time varying speeds. 



2 Energy Efficiency of Scheduling on Asymmetric Multiprocessors 

Asymmetric platforms can achieve energy-efficiency since the lower the processing speed, the 
lower the power consumption is [10]. Reducing energy consumption is important for battery oper- 
ated mobile computing devices but also for desktop computers and servers. To examine the energy 
usage of multiprocessor systems we adopt the model of [1]: the energy consumption per unit time 
is a convex function of the processor speed. In particular, the energy consumption of processor k 
is proportional to c{k)°' ■ t, where a > 1 is a constant. Clearly by increasing the makespan of a 
schedule we can reduce the energy usage. 

We design the preemptive algorithm "Save-Energy" (see Alg.l) that post processes a schedule 
of tasks to processors in order to improve the energy efficiency by reassigning tasks to processors 
of slower speed. We assume no restrictions in the number of speeds of the processors and rearrange 
tasks so that the makespan is not affected. This reduces the energy consumption since in our model 
the energy spent to process a task is proportional to the speed of the processor to the power of 
a (where a > 1). In this sense, our algorithm will optimize a given schedule so that maximum 
energy efficiency is achieved. 



Input: An assignment of tasks to processors 

Output: An assignment of tasks to processors witfi reduced energy consumption 

Split schedule in intervals tj, where j G [1 . . . tq] 
Sort times in ascending order. 

T ^ To 

for c — c(2) to c{rn) do 
for i ^ 1 to r do 

H holes in lower speeds that processing of ti can fit without conflict in other assignments 

Fit task h in as many slower speeds starting from holes at c(l) to c, but if at r^, h fits to 2 or more speeds fill the 



if h does not fit exactly then 

Create a new t' at the time preemption happens 

Fit h in extended slot 
end 

end 

r ^- T^previous) + Set of times that preemption occured 
end 



Algorithm 1: "Save-Energy" 

We start by sorting the processors according to the processing capability pi, . . . ,p„i so that 
c(l) > c(2) . . . > c{m). We then split time in intervals tj, where j E [I . . . tq], where tq is such 
that between these intervals there is not any preemption, no task completes and no changes are 
made to the precedence constraints. Furthermore we denote x^ = 1 if at tj we use c(i) and 
otherwise. So the total energy consuption of the schedule is E = Y.'i^i Y.]'LiX''iC{i)°'tj. 

Theorem 1 (Condition of optimality). If E is the optimal energy consumption of a schedule (i.e., 
no further energy savings can be achieved), the following holds: There does not exist any ti,tj, 
where i,jE[l... tq], so that a list / initially assigned to speed c{u) at time tj can be rescheduled 
to tj with speed c(v) ^ c{u) and reduce energy. 

Proof. Suppose that we can reduce the energy E of the schedule. We obtain a contradiction. We 
can assume without any loss of generality that there exists at time tj a core u that processes a list at 



speed c{u) and there is a tj so that we can reschedule it to processor v with speed c{v) < c{u). This 
is so because if c{v) > c{u) we will not have energy reduction. Therefore since U, tj exists then 
the new energy E' must be lower than E. There exist only three cases when we try to reschedule a 



list / from tj to tj from c{u) to c{v) where c{v) < c{u) 



(1) The process of I at tifits exactly to tj. This is the case when tj ■ c{u) = tj ■ c{v). In this case 

c{u)°'U 



E' = E — c{u)"-ti + c{vYtj . But this violates the requirement E' < E since ^:^L*^ < 1 because 



-H^ ■ -r = -H^ ■ i\ = ri < 1 (recall that a > 1). 

c{u)" ti c{v)°' c(v) \c{v) J ^ ' 

(2) The process of I at tifits to tj and there remains time at tj. This is the case when tj ■ c{u) < 
tj ■ c{v). Again we reach a contradiction since the new energy is the same with the previous 
case since tj • c{u) < tj ■ c(v) and there exists t' so that tj • c{u) = t'- ■ c{v). 

(3) The process of I at ti does not fit completely to tj. This is the case when ti • c{u) > tj ■ c{v). Now 
we cannot move all the processing of / from ti to tj. So there exists tj so that c{u) ■ t[ = c{v) ■ tj. 
So the processing splits in two, at time ti for ti — tj and completely to tj. The energy we save 

is c(u)" ■ (tj - t'i) + c(i')° ■ tj - c(m)" ■ tj = c(t;)" ■ tj - c{uf ■ t^ < because ^ < 1 ^ 

fS^ • f} < 1 ^ g^ ■ I < 1 which proves the theorem. 

D 

Theorem 2. If the processing of list / at tj at speed c{u) fits completely to tj to two different speeds 
or more, we save more energy if we reschedule the list to the speed which is closer to " y ^ ■ c{u), 
and when a is 2 it simplifies to ^^. 

Proof. The whole processing of / must not change. So t'^ is the time that the list will remain on 
speed c{u) and can be calculated by the equation t'i ■ c{u) +tj ■ c{v) = ti ■ c{u) . So the energy that we 
spend if we do not use c(f) is Egtart = c{u)'^ -ti and if we use c{v) is Ec[v) = c{u)'^ ■t'i + c{v)"- -tj. 
So Ec(v) = t'i-c{u)"+tj-c{v)°' = E start— c{v)-tj (c(m)"~^ — c{vY~^) and bccausc c(-u) > c{v) 
we always save energy if we reschedule any task to a lower speed. The minimum energy occurs 
when the differential equals to zero. That happens when [tj ■ c{v) ■ c(m)"^^ — tj ■ c(v)") = ^ 

tj ■ c{u)"'^^ = atj ■ c{v)"^^ =^ c{v) = " y ^ ■ c{u). Now if the fragment of the list can be 
reassigned to a further smaller c(z) we obtain an even smaller energy schedule. Thus we try to fill 
all the holes starting from lower speeds and going upwards, in order to prevent total fragmentation 
of the whole schedule and obtain a schedule of nearly optimal energy consumption on the condition 
of unharmed makespan. D 

The algorithm "Save-Energy" clearly does not increase the makespan since it does not delay 
the processing of any task, instead there may be even a reduction of the makespan. The new hole 
has size 44 < 1 of the previous size and in every execution, a hole that can be filled goes to a 
faster processor. 

In arbitrary DAGs the problem is that due to precedence constraints we cannot swap two time 
intervals. To overcome this problem we proceed as follows: we define the supported set (STj) to 
be all the tasks that have been completed until time ti as well as those currently running and those 



who are ready to run. Between two intervals that have the same ST we can swap, or reschedule any 
assignment so we run the above algorithm between all of these marked time intervals distinctly to 
create local optimums. In this case the complexity of the algorithm reduces to O^m^-Y, df) where 9i 
is the time between two time intervals with different ST while in list of tasks the time complexity 
is 0{tq ■ vn?). We note that in general DAGs the best scheduling algorithms for distinct speeds 
produces an 0(log /f) -approximation (where K is the number of essential speeds). In cases where 
schedules are far from tight, the energy reduction that can be achieved in high. 

3 Time Efficiency of Sclieduling on Asymmetric Multiprocessors 

We continue by providing some arguments for using asymmetric multiprocessors in terms of time 
efficiency. We show that preemptive scheduling in an asymmetric multiprocessor platform achieves 
the same or better optimal makespan than in a symmetric multiprocessor platform. The basic char- 
acteristic of our approach is speed asymmetry. We assume that the overhead of (re)assigning pro- 
cessors to tasks of a parallel job to be executed is negligible. 

Theorem 3. Given any list L of r chains of tasks to be scheduled on preemptive machines, an 
asymmetric multiprocessor system will always have a better or equal optimal makespan than a 
symmetric one, given that both have the same average speed (s') and the same total number of 
processors (m). The equality holds if during the whole schedule all processors are busy. 

Proof. Again we start by sorting the processors according to the processing capability pi, . . . ,pm 
so that c(l) > c(2) . . . > c(m). We then split time in intervals tj, where j E {1 ... m], so that 
between these m intervals there is not any preemption, no task completes and no changes are made 
to the precedence constraints. This is feasible since the optimal schedule is feasible and has finite 
preemptions. 

Let OPT„ the optimal schedule for the symmetric multiprocessor system. Now consider the 
interval (tj, U+i) where all processors process a list and divide it in m time intervals. We assign 
each list to each of the m asymmetric processors that are active, so that a task is assigned sequen- 
tially to all processors in the original schedule of OPT^. So each task will be processed by any 
processor for — ■ (tj+i — tj) time. Thus every task will have been processed during (tj, tj+i) with an 

"* y™ c(j) 
average speed of ^'-^ — , which is the speed of every symmetric processor. Thus given an optimal 

schedule for the symmetric system we can produce one that has at most the same makespan on the 

asymmetric set of processors. 

The above is true when all processors are processing a list, at all times. Then the processing in 

both cases is the same. Of course there are instances of sets of lists that cannot be made to have 

all processors running at all times. In such schedules the optimal makespan on the asymmetric 

platform is better. Recall that we have sorted all speeds. Since the system is asymmetric it must 

have at least 2 speeds. If at any time of OPT^ we process less lists than processors, following 

the analysis above, we will have to divide the time in (number of lists processing) < m (denoted 

by A). So during time-interval [U, tj+i) the processing of any list that is processed on symmetric 

systems will be s' ■ (tj+i — tj). While for the asymmetric system, the processing speed for the 

V^ c(i) 

same time-interval will be ^^^ . Note that sum in the second equation is bigger than that of the 

first. That is valid because we use only the fastest processors. More formally ^ > ^^^i^^^"^' > 
c(i)+.^.+c(A) > > c{i)+..A-c^m) ^ g^ ^g produced a schedule that has a better makespan than 



OPT(j. In other words, if during the optimal schedule for a symmetric system there exists at least 
one interval where a processor is idle, we can produce an optimal schedule for the asymmetric 
multiprocessors platform with smaller makespan. D 

Theorem 4. Given any DAG Q of tasks to be scheduled on preemptive machines, an asymmetric 
multiprocessor system will always have a better or equal optimal makespan than a symmetric one, 
provided that both have the same average speed (s') and the same total number of processors (m). 
The equality holds if during the whole schedule all processors are busy. 

Proof. We proceed as above. The difference is that we split time in (ti, ^2, . . . , tm) that have the 
following property: between any of these times (t,, tj+i) there is not any preemption on processors 
or completion of a list or support for any list that we could not process at U due to precedence- 
constraints. 

When all processors are processing a list, at all times, the processing in both asymmetric and 
symmetric systems is the same, i.e., m ■ s' ■ (tj+i — ti). Of course there are DAGs that cannot 
be made to have all processors running at all times due to precedence-constraints or due to lack 
of tasks. In such DAGs the optimal makespan on the assymetric system is better than that of the 
symmetric one. If at any time of OPT^^ we process less lists than processors, following the analysis 
of Theorem 3 we have that on the symmetric system the total processing will be J2'j=i s' " (U+i — 
ti) = X ■ s' ■ (f j+i — ti) while the processing speed during the same interval on the asymmetric one 
will be Yli=i c{i) ■ (tj+i — tj) which is better. D 

4 Multiprocessor Systems of Limited Asymmetry 

We now focus on the case where the multiprocessor system is composed of a single fast processor 
and multiple slow ones, like the one designed in [6]. Consider that the fast processor has speed s 
and the remaining m — 1 processors have speed 1 . In the sequel preemption of tasks is not allowed. 

We design the non-preemptive algorithm "Remnants" (see Alg.2) that always gives schedules 
with makespan T < Tc^t + -. We greedily assign the fast processor first in each round. Then 
we try to maximize parallelism using the slow processors. In the beginning of round k we denote 
remk{i) the suffix of list Lk not yet done. Let Rkii) = \remk{i)\. For n tasks, the algorithm can 
be implemented to run in O (^n"^ lognj time. The slow processors, whose "list" is taken by the 
speedy processor in round k, can be reassigned to free remnants. Remark in the speed assignment 
produced by "Remnants" we can even name the processors assigned to tasks (in contrast of general 
speed assignment methods, see e.g., [8,5,4]). Thus the actual scheduling of tasks is much more 
easy and of reduced overhead. 

As an example, consider a system with 3 processors (m = 3) where the speedy processor 
has s = 4. In other words, we have a fast processor and two slow ones. We wish to schedule 4 
lists, where h = 3, ^2 = 3, ^3 = 2 and I4 = 2. The "remnants" algorithm produces the following 
assignment with a makespan of T = 2: 

I/3 L4 

1 O Round 1 




Round 2 



Input: Lists Li ,..., Lr of tasks 

Output: An assignment of tasks to processors 

k^ 1 

while there are nonempty lists do 

for j ^ 1 to r do remk{i) ~ Li 

gk ^- number of nonempty lists 

Sort and rename the remnants so tiiat Rk{l) > ^fc(2) > . . . > Rk{gk) 

IJ <— s, V ^ 1 

/* Assign the fast processor sequentially to s tasks */ 

while u > and v < gk do 

p ^ min (u,Rk(v)) 

Assign p tasks of rerrik (u) to fast processor and remove from renik (v) 

u ^ u — p, V ^ V + 1 
end 
/* Assign slow processors to beginning task of each remnant lists not touched 

by the fast speed assignment */ 

itv < gk then 

q ^ min{gk,m — 1) 

for TO ^ u to g do 
I Assign first task of reruk (w;) to slow processor and remove from renik (w) 

end 
end 

Remove assigned tasks from the lists 
k^k + 1 
end 



Algorithm 2: "Remnants" 



Notice that the slow processors, whose "list" is taken by the speedy processor in round k, can be 
reassigned to free remnants (one per free remnant). So our assignment tries to use all available 
parallelism per round. 

Now consider the case where the fast processor has s = 3, that is, it runs slower than the 
processor of the above example. For the same lists of tasks, the algorithm now produces a schedule 
with a makespan of T = 2 + -: 



Round 1, * 

1 ( ) ^-» 1 



Round 2 




Notice that for this configuration, the following schedule produces a makespan of 2: 



In the following theorem we show that the performance of Remnants is actually very close to 
optimal, in the sense of arguing that the above counter-example is essentially the only one. 



Theorem 5. For any set of lists L and multiprocessor platform with one fast processor of speed 
s and m — 1 slow processors of speed 1, ifT is the makespan of Algorithm Remnants then T < 

Topt + -■ 

Proof. We apply here the construction of Graham, as it was modified by [5], which we use in 
order to see if T can be improved. Let ji a task that completes last in Remnants. Without loss 
of generality, from the way Remnant works, we can always assume that j„ was executed by the 
speedy processor. We consider now the logical chain ending with ji as follows: Iteratively define 
jt+i as a predecessor of j^ that completes last of all predecessors of jt in Remnants. In this chain 
(a) either all its tasks were done at speed c (in which case and since the fast processors works all 
the time, the makespan T of Remnants is optimal), or (b) there is a task t* at distance at most s — 1 
from ti that was done by speed 1 in Remnants. In the later case, if x is the start time of ti, this 
means that before x all speed 1 processors are busy, else ti could be have scheduled earlier. 

(b. 1) If there is no other task in the chain done at speed 1 and before ti then again T is optimal since 
before ti all processors of all speeds are busy. 

(b.2) Let t2 be another task in the chain done at speed 1 and ^2 < ti. Then ^2 must be an immediate 
predecessor of ti in a chain (because of the way Remnants work) and, during the execution of 
^2, speed s is busy but there could be some processor of speed 1 available. Define t^, . . . ,tj 
similarly (tasks of the last chain, all done in speed 1 and tk < tk-i, k = j . . .2). This can go 
up to the chain's start, which could have been done earlier by another speed 1 processor and 
this is the only task that could be done by an available processor, just one step before. So, the 
makespan T of algorithm Remnants can be compressed by only one task, and become optimal. 
But then T < To-pt + - (i.e., it is the start of the last list that has no predecessor and which could 
go at speed 1 together with nodes in the previous list). 

D 

4.1 An LP-relaxation approach for a schedule of good expected makespan 

In this section we relax the limitations to asymmetry. We work on the more general case of having 
nis fast processors of speed s and m — rus slow processors of speed 1. Note that we still have 
two distinct speeds and preemption of tasks is not allowed. We follow the basic ideas of [4] and 
specialize the general lower bounds on makespan for the more general case. Clearly, the maximum 
rate at which the multiprocessor system of limitted asymmetry can process tasks is mg ■ s + (m — 
nis) ■ 1, which is achieved if and only if all machines are busy. Therefore to finish all n tasks 
requires time at least A = -^ . Now let 

B = max 



l<j<mm(r,m) J2l^^ c(i) 

where c(l) = . . . = c{ms) = s and c{ms + 1) = . . . = c{m) = 1 are the individual processor 
speeds from the fast to the slow. It follows that, 

- > -1^ > . . . > : J = m,) 

s 2s js 

The interesting case is when nig < r. So, we assume nis < f and let Is = h + h + ■■■ + Ims- Thus 



B= max ' ^« + n=™=+i^* 



ms 



-l<j<mm(r,m) ^772^ ■ {s — 1) + j — 1 j 

By [4] then 

Lemma 1. Let Topt the optimal makespan of r chains. Then Topt > max(A, B). 

Since the average load is also a lower bound for preemptive schedules we get 

Corollary 1. max(A, B) is also a lower bound for preemptive schedules. 

As for the case where we have only one fast processor, i.e. nig = 1, in each step, at most s + 
min(m, r — 1) tasks can be done since no two processors can work in parallel on the same list. 
This gives T„„t > — — . /" -, — r. Of course the bound T™+ > B still also holds. 

o upi. — s+mm(r— l,m) "t"- — 

For a natural variant of list scheduling where no preemption takes place, called speed-based 
list scheduling, developed in [5], is constrained to schedule according to the speed assignments of 
the jobs. In classical list scheduling, whenever a machine is free the first available job from the list 
is scheduled on it. In this method, an available task is scheduled on a free machine provided that 
the speed of the free machine matches the speed assignment of the task. The speed assignments 
of tasks have to be done in a clever way for good schedules. In the sequel, let D^ = —^ ■ Ug 
where n^ < n is the number of tasks assigned to speed s. Let Di = "~"° . Finally, for each chain 
Li and each task j E Li with c(j) being the speed assigned to j, compute qi = J2jeLi ^ ^i^d 
let C = maxjg/^ q^. The proof of the following theorem follows from an easy generalization of 
Graham's analysis of list scheduling. 

Theorem 6 (specialization of Theorem 2.1, [5]). For any speed assignment (c{j) = s or 1) to 

tasks j = 1 . . .n, the non-preemptive speed-based list scheduling method produces a schedule of 
makespan T <C + Ds + Di. 

Based on the above specializations, we wish to provide a non-preemptive schedule (i.e., speed 
assignment) that achieves good makespan. We either assign tasks to speed s or to speed 1 so that 
C + Ds + Di is not too large. Let, for task j: 



Xj 



1 when c{j) 
otherwise 



1 when c{j) = 1 
otherwise 



and 

Vj 
Since each task j must be assigned to some speed we get 

\/j = l...n Xj + yj = l (1) 

In time D, the fast processors can complete J2j=i Xj tasks and the slow processors can complete 
Ej=i Vj tasks. So 



and 



rus 


■ s 


yn 


-.iVj 


m — 


m. 



<D (2) 

<D (3) 



Let tj be the completion time of task j 

(t, > 0) (4) 

If j' < j then clearly 

— + yj< tj - if (5) 

Also 

Vj : t, < D (6) 

and 

Vjix,,?/, €{0,1} (7) 

Based on the above constraints, consider the following mixed integer program: 



MIP: 

minD 

under (1) to (7) 



MIP's optimal solution is clearly a lower bound on Topt- Note that (2) =^ Dg < D and (3) ^ 
Di < D. Also note that since t^/ > ^ — + Hj < tj by (5) and thus also C < D, by adding times 
on each chain. So, if we could solve MIP then we would get a schedule of makespan T < 3 ■ Topt, 
by Theorem 6. Suppose we relax (7) as follows: 

Xj,yje [0,1] j = l...n (8) 

Consider the following linear program: 



LP: 

minD 

under (1) to (7) and (8) 



This LP can be solved in polynomial time and its optimal solution x], y], tj, where j = 1 . . .n, 
gives an optimal D, also D < Topt (because D < best D of MIP). 
We now use randomized rounding, to get a speed assignment Ai 

w. T c(/) = s with probability TJ 

■' ' c{j) = 1 with probability I — xj = yj 

Let Ta, be the makespan of Ai. Since Ta, <C + Ds + Di ^ E{Ta,) < E{C) + E{D,) + E{Di). 
But note that 

n n 

E{ns) = ^x] and E{n-ns) = ^yj 

so E{Di),E{D2) < Dhy (2,3) and, for each list Li 

I.e., E(C) < D. So we get the following theorem: 

Theorem 7. Our speed assignment Ai gives a non-preemptive schedule of expected makespan at 
most 3 ■ To-pt 



Our MIP formulation also holds for general DAGs and 2 speeds, when all tasks are of unit length. 
Since Theorem 6 of [5] and the lower bound of [4] also holds for general DAGs, we get: 

Corollary 2. Our speed assignment Ai, for general DAGs of unit tasks gives a non-preemptive 
schedule of expected makespan at most 3 ■ Topt- 

We continue by making some special consideration for lists of tasks, that is we think about DAGs 
that are decomposed in sets of lists. Then, Ai can be greedily improved since all tasks are of unit 
processing time, as follows. After doing the assignment experiment for the nodes of a list Li and 
get Ij nodes on the fast processors and If nodes on slow processors. We then reassign the first 1} 
nodes of Li to the fast processors and the remaining nodes of Li to the slow processors. Clearly 
this does not change any of the expectations of Ds, Di and C. Let Ai be this modified (improved) 
schedule. 

Also, because all tasks are equilenght (unit processing time), any reordering of them in the 
same list will not change the optimal solution of LP. But then, for each list Li and for each task 
j G Li, Tj is the same (call it xj), and the same holds for y]. Then the processing time of Li is just 
'^ + (1 — fi) where /j is as the Bernoulli B{li,Xi). 

In the sequel, let Vi : /j > 7 ■ n, for some 7 G (0, 1) and let s ■ m = o{n) = rf, where e < L 
Then from Ai we produce the speed assignment A2 as follows: 



foreach list Lj, « = 1 . . . r do 

I assign all the nodes of Li to unit speed 
else 

I for Li, A2 = Ai 
end 
end 



T^nt > -, T > — = n^-' 

s ■ m^ -\- [m — m^j sm 



Since for the makespan T^- of Ai we have 

E{Tj:)=E{TA,)<^T,pt 

we get 

e{t^^ <^-Topt + s\ogn 
But 

- opt 

Thus 

i?(T2^)<(3 + o(l))T,,, 

However, in A2 , the probability that T-j; > E ( T^ j (1 + (3), where /? is a constant ( , 1 ) , is at most 

- exp [~^ ■ h ■ Xi) (by Chemoff bounds), i.e., at most -(^) ^ • This implies that it is enough to 
repeat the randomized assignment of speeds at most a polynomial number of times and get a 
schedule of actual makespan at most (3 + o(l)) Topt- So, we get our next theorem: 

Theorem 8. When each list has length k > ^ ■ n (where 7 G (0, 1)) and s ■ m = n^ (where e < 1) 
then we get a (deterministic) schedule of actual makespan at most (3 + o(l)) Topt in expected 
polynomial time. 



5 Conclusions and Future work 

Processors technology is undergoing a vigorous shaking-up to enable low-cost multiprocessor plat- 
forms where individual processors have different computation capabilities. We examined the en- 
ergy consumption of such asymmetric arcitectures. We presented the preemptive algorithm "Save- 
Energy" that post processes a schedule of tasks to reduce the energy usage without any deteriora- 
tion of the makespan. Then we examined the time efficiency of such asymmetric architectures. We 
shown that preemptive scheduling in an asymmetric multiprocessor platform can achieve the same 
or better optimal makespan than in a symmetric multiprocessor platform. 

Motivited by real multiprocessor systems developed in [6, 10], we investigated the special case 
where the system is composed of a single fast processor and multiple slow processors. We say that 
these architectures have limited asymmetry. Interestingly, alghough the problem of scheduling has 
been studied extensively in the field of parallel computing and scheduling theory, it was considered 
for the general case where multiprocessor platforms have K distinct speeds. Our work attempts to 
bridge between the assumptions in these fields and recent advances in multiprocessor systems 
technology. In our simple, yet realistic, model where K = 2, we presented the non-preemptive 
algorithm "Remnants" that achieves almost optimal makespan. 

We then generalized the limited asymmetry to systems that have more than one fast processors 
while K = 2. We refined the scheduling policy of [5] and give a non-preemptive speed based list 
Randomized scheduling of DAGs that has a makespan T whose expectation E(T) < 3 ■ OPT. This 
improves the previous best factor (6 for two speeds). We then shown how to convert the schedule 
into a deterministic one (in polynomial expected time) in the case of long lists. 

Regarding future work we wish to examine trade-offs between makespan and energy and we 
also wish to investigate extensions for our model allowing other aspects of heterogeneity as well. 
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